Adaptive Grids for Clustering Massive Data Sets
نویسندگان
چکیده
Clustering is a key data mining problem. Density and grid based technique is a popular way to mine clusters in a large multi-dimensional space wherein clusters are regarded as dense regions than their surroundings. The attribute values and ranges of these attributes characterize the clusters. Fine grid sizes lead to a huge amount of computation while coarse grid sizes result in loss in quality of clusters found. Also, varied grid sizes result in discovering clusters with different cluster descriptions. The technique of Adaptive grids enables to use grids based on the data distribution and does not require the user to specify any parameters like the grid size or the density thresholds. Further, clusters could be embedded in a subspace of a high dimensional space. We propose a modified bottom-up subspace clustering algorithm to discover clusters in all possible subspaces. Our method scales linearly with the data dimensionality and the size of the data set. Experimental results on a wide variety of synthetic and real data sets demonstrate the effectiveness of Adaptive grids and the effect of the modified subspace clustering algorithm. Our algorithm explores at-least an order of magnitude more number of subspaces than the original algorithm and the use of adaptive grids yields on an average of two orders of magnitude speedup as compared to the method with user specified grid size and threshold.
منابع مشابه
Clustering Algorithm for 2D Multi-Density Large Dataset Using Adaptive Grids
Clustering is a key data mining problem. Densitybased clustering algorithms have recently gained popularity in the data mining field. Density and grid based technique is a popular way to mine clusters in a large spatial datasets wherein clusters are regarded as dense regions than their surroundings. The attribute values and ranges of these attributes characterize the clusters In this paper we a...
متن کاملMafia: Eecient and Scalable Subspace Clustering for Very Large Data Sets Center for Parallel and Distributed Computing Mafia: Eecient and Scalable Subspace Clustering for Very Large Data Sets
Clustering techniques are used in database mining for nding interesting patterns in high dimensional data. These are useful in various applications of knowledge discovery in databases. Some challenges in clustering for large data sets in terms of scalability, data distribution, understanding end-results, and sensitivity to input order, have received attention in the recent past. Recent approach...
متن کاملA Scalable Parallel Subspace Clustering Algorithm for Massive Data Sets
Clustering is a data mining problem which finds dense regions in a sparse multi-dimensional data set. The attribute values and ranges of these regions characterize the clusters. Clustering algorithms need to scale with the data base size and also with the large dimensionality of the data set. Further, these algorithms need to explore the embedded clusters in a subspace of a high dimensional spa...
متن کاملPARALLEL ALGORITHMS FOR CLUSTERINGHIGH - DIMENSIONAL LARGE - SCALE DATASETSHarsha
Clustering techniques for large scale and high dimensional data sets have found great interest in recent literature. Such data sets are found both in scientiic and commercial applications. Clustering is the process of identifying dense regions in a sparse multi-dimensional data set. Several clustering techniques proposed earlier either lack in scalability to a very large set of dimensions or to...
متن کاملHigh Performance Subspace Clustering for Massive Data Sets
Business establishments collect vast amounts of data every day. Leveraging this data for smart decision making is the key to identifying pro t opportunities, customer retention and giving a winning touch to the business. The path from large amounts of data to Knowledge Discovery is Information Mining, using a sophisticated set of tools to uncover associations, patterns, and trends; detect devia...
متن کامل